Sources of Data





Kerry Back

Price and volume data

  • Free sources like Yahoo usually suffer from survivorship bias
    • Only have data on currently existing companies
    • Will miss poor returns of companies that delisted
  • Bloomberg, FactSet, Capital IQ, …
  • Academic work usually uses CRSP data
    • Center for Research in Security Prices at U. Chicago
    • dates back to 1926

Company financials and actions

  • Yahoo only goes back 5 years
  • Could scrape SEC’s EDGAR site
  • FactSet, Nasdaq Data Link, and others
  • Academic work usually uses Compustat (Capital I.Q.)
    • dates back to 1960s

Trade data

  • Corporate insiders
  • Short interest
  • Quarterly fund filings
  • Retail order flow (buy and sell)

Macroeconomic data

  • Federal Reserve Economic Data (FRED)
  • Energy Information Administration (EIA)
  • World Bank, …

Sentiment data

  • Scrape social media or buy/scrape news
  • Extract mentions of tickers
  • Use machine-learning NLP (natural language processing) to classify as positive, negative, or neutral

Image data

  • Satellite and drone imagery
  • Warehouse truck activity, cars in parking lots, …
  • Use machine learning/AI to analyze images

Consumer data

  • Search engine traffic
  • Store traffic
  • Retail sales

Yahoo adjusted close

  • Yahoo’s adjusted close is split and dividend adjusted.
  • Pct change in adjusted close \(\sim\) close-to-close return
  • On ex-dividend days, pct change in adj close is

\[P_t / (P_{t-1}-D_t)\]

which is not exactly but is close enough to

\[(P_t + D_t)/P_{t-1}\]

Returns from Yahoo in python

from pandas_datareader import DataReader as pdr
data = pdr('cvx', 'yahoo', start=2010)
price = data['Adj Close']
ret = price.pct_change()
ret.plot()

Fed funds rate from FRED in python

fed = pdr('dff', 'fred', start=2010)
fed.plot()

VIX from FRED in python

vix = pdr('vixcls', 'fred', start=2010)
vix.plot()

Monthly returns and characteristics

  • We’ll use monthly data 2000-2021.
  • Monthly returns
  • 100+ stock characteristics known at the beginning of each month
  • Mimic trading monthly
    • Form portfolio at beginning of month
    • Observe returns and changes in characteristics
    • Form new portfolio, …

  • Variable definitions are in ghz-predictors.xlsx
  • Data is on a SQL server at CloudClusters.net